January 13, 2016
This document creates a model object using lm() using Landsat band reflectance values as predictors for tree cover (VCF). Using the resulting model object, VCF values are predicted for the Gewata area.
Set the proper input and output folder
inFolder <- '/home/user/Projects/AssignmentLesson8/data'
outFolder <- '/home/user/Projects/AssignmentLesson8/output'
library(raster)
## Loading required package: sp
Load all the data, put it into a brick and give them proper names
gewata_list <- list.files(inFolder, pattern = glob2rx('*Gewata*.rda'), full.names = TRUE)
for(i in 1:length(gewata_list)){
load(gewata_list[i])
}
Cut the clouds and water from the VCF file
vcfGewata[vcfGewata > 100] <- NA
alldata <- brick(GewataB1, GewataB2, GewataB3, GewataB4, GewataB5, GewataB7, vcfGewata)
names(alldata) <- c("band1", "band2", "band3", "band4", "band5", "band7", "VCF")
plot(alldata)
# Extract all data to a dataframe
df <- as.data.frame(getValues(alldata))
Make plots for every band against the VCF layer
for(i in 1:6){
pairs(alldata[[c(i,7)]], maxpixels=10000)
}
Make the linear model. We decided to leave out band 7 because there were some holes in the image.
model <- lm(VCF ~ band1 + band2 + band3 + band4 + band5, data = df)
summary(model)
##
## Call:
## lm(formula = VCF ~ band1 + band2 + band3 + band4 + band5, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.204 -4.636 0.715 5.211 258.586
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.550e+01 5.601e-02 1526.5 <2e-16 ***
## band1 9.291e-02 1.876e-04 495.3 <2e-16 ***
## band2 -1.505e-01 2.314e-04 -650.5 <2e-16 ***
## band3 -2.504e-03 1.473e-04 -17.0 <2e-16 ***
## band4 1.661e-02 2.443e-05 679.9 <2e-16 ***
## band5 -2.004e-02 4.158e-05 -482.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.606 on 1808278 degrees of freedom
## (13712 observations deleted due to missingness)
## Multiple R-squared: 0.8582, Adjusted R-squared: 0.8582
## F-statistic: 2.189e+06 on 5 and 1808278 DF, p-value: < 2.2e-16
You can see that the coefficients are significant for each layer. When band 7 is included, the coefficient for band 7 is not significant, another reason why we left it out.
After, predict the VCF map. The linear model does not know that the VCF can only be between 0 and 100 and therefore we transformed all values below 0 to 0.
# Predict the map
vcfMap <- predict(alldata[[1:5]], model = model)
# Change negative values to 0 because VCF can just be between 0 and 100
vcfMap[vcfMap < 0] <- 0
par(mfrow = c(1, 2))
plot(vcfMap, main = "Predicted VCF map")
plot(alldata$VCF, main = "Actual VCF map")
par(mfrow=c(1,1))
The two plots above are the predicted VCF map (left), and the actual VCF map (right). You can see that they are quite similar, however the lower values occur a bit more in the actual plot. This means that probably the land cover class ‘forest’ is easier to predict and therefore more often correct in the predicted map.
Calculate the RMSE and show a histogram of the data.
squared_differences <- (alldata$VCF-vcfMap)^2
plot(squared_differences)
mean_squared = cellStats(squared_differences, stat = 'mean')
RMSE = sqrt(mean_squared)
print(paste("The RMSE is:", RMSE))
## [1] "The RMSE is: 8.44849912661772"
Load training zones and make a raster of the class codes.
load(file.path(inFolder,'trainingPoly.rda'))
# Give the classes a code and make a raster of it
trainingPoly@data$Code <- as.numeric(trainingPoly@data$Class)
trainingPoly@data
## OBJECTID Class Code
## 0 1 wetland 3
## 1 2 wetland 3
## 2 3 wetland 3
## 3 4 wetland 3
## 4 5 wetland 3
## 5 6 forest 2
## 6 7 forest 2
## 7 8 forest 2
## 8 9 forest 2
## 9 10 forest 2
## 10 11 cropland 1
## 11 12 cropland 1
## 12 13 cropland 1
## 13 14 cropland 1
## 14 15 cropland 1
## 15 16 cropland 1
classes <- rasterize(trainingPoly, squared_differences, field='Code')
This raster can then be used to perform the zonal statistics and calculate the RMSE for each class.
# Perform zonal statistics to calculate RMSE of each zone
mat_mean <- zonal(squared_differences, classes, fun = 'mean', na.rm=T)
sqrt_vector <- sqrt(mat_mean[,2])
stat_df <- as.data.frame(mat_mean)
stat_df$RMSE <- sqrt_vector
stat_df$class <- c("cropland","forest","wetland")
stat_df
## zone mean RMSE class
## 1 1 86.33781 9.291814 cropland
## 2 2 24.87072 4.987055 forest
## 3 3 141.92961 11.913421 wetland
worst_zone <- stat_df[which(stat_df$RMSE == max(stat_df$RMSE)), ]
worst_zone
## zone mean RMSE class
## 3 3 141.9296 11.91342 wetland
You can see in the matrix that zone 1 and 3 (cropland and wetland respectively) have on average a higher RMSE than zone 2 (forest). This was expected when comparing the maps, as forests were usually correctly predicted and the other 2 zones gave more differences. Wetland showed the highest average RMSE.